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SOLVING  LINEAR  ALGEBRAIC  EQUATIONS 
CAN  BE  INTERESTING 


1 

George  E.  Forsythe 

The  subject  of  this  talk  is  mathematically  a 
system  of  n linear  algebraic  equations  in  n 

Ax  - b 

Here  A is  a square  matrix  of  order  n,  whose  elements  are  given  real 

numbers  a^,.  with  a determinant  d(A)  / Oj  x and  b denote  column  vectors 

and  the  components  of  b are  given  real  numbers .(Complex  numbers 

would  offer  no  essential  difficulty.)  It  is  desired  to  calculate  the 

"1  -1 

components  of  the  unique  solution  x e A b$  here  A is  the  inverse 
of  A. 

Such  problems  arise  in  the  most  diverse  branches  of  science  and 
technology,  either  directly  (e.g.,  the  normal  equations  of  the  least- 
squares  adjustment  of  observations)  or  in  an  approximation  to  another 

A talk  delivered  (under  a similar  title)  before  the  Eugene  meet- 
ing of  the  American  Mathematical  Society,  June  21,  1952,  by  invitation 
of  the  Committee  to  Select  Hour  Speakers  for  Western  Sectional  Meet- 
ings j received  by  the  editors  , 1952. 

n 

Sponsored  in  part  by  the  Office  of  Naval  Research,  USN. 


1.  Introdue^on. 
lowly  one.  Consider  a 
unknowns,  written 

(1) 
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problem  (e.g.,  the  difference-equation  approximation  to  a self-adjoint 
boundary-value  problem  for  a partial  differential  equation) . These 
two  are  very  frequent  sources  of  numerical  systems 5 note  that  A > 0 
(i.e.,  A is  symmetric  and  positive  definite)  in  both  examples.  The 
order  n is  considered  to  range  from  perhaps  6 or  8 up  to  as  large  a 
number  as  can  be  handled.^ 

^Steam  [111],  for  instance,  mentions  the  solution  of  a svstem  of 
order  2300  by  the  U.S.  Coast  and  Geodetic  Survey.  The  accuracy  demanded 
of  an  approximate  solution  \ varies  widely;  even  the  function  which 
is  to  measure  the  accuracy  of  \ varies  or  is  unknown.  Some  "customers” 
want  to  make  the  length  |b  - a||  small;  some,  | \ - A '4>|;  others 
have  apparently  thought  only  in  terms  of  getting  A"d  exactly. 

We  all  know  that  each  component  of  the  solution  A ^b  can  be  ex- 
pressed as  a quotient  of  determinants  by  Cramer's  rule.  We  have  all 
evaluated  determinants  of  orders  3,  U,  and  possibly  £,  with  a^ 
integers;  it  is  quite  easy  and  rather  boring.  I therefore  suspect 
that  the  average  mathematician  damns  the  practical  solution  of  (l) 
as  being  both  trivial  and  dull. 

One  defense  of  the  numerical  analyst  (preface  of  [81])  is  to 
show  that  in  many  practical  cases  (say  for  decimal  fractions  a^  and 
n » 10)  getting  A "H)  efficiently  (or  at  all)  is  actually  not  trivial, 
but  demands  both  know-how  and  planning.  This  point  is  especially 
well  taken  against  those  whose  program  for  solving  (l)  would  be  to 
evaluate  n + 1 determinants  from  their  definition,  employing  nl  mul- 
tiplications per  determinant.  For  n “ 26,  for  example, 
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® Ofi  ' 

(n  + l)  l * 10  , a number  of  multiplications  which  would  take  the 

2 17  3 

SWAC  some  10  ' years.  Actually,  only  about  (l$)h:  multiplications 

2 , 

^National  Bureau  'of  Standards  Western  Automatic  Computer,  an 
electronic  machine  easily  capable  of  2000  multiplications  per  second. 
See  [6l]  for  a description,  now  out  of  date. 

f 

are  needed  to  solve  (1);  see  Bodewig  [12],  For  n ■ 26,  (i/^n^  * 6000. 
and  the  multiplications  would  take  the  SWAC  about  3 seconds. 

It  is  my  hope,  on  the  other  hand,  to  arouse  the  mathematician’s 

t, 

interest  by  showing  (section  2)  the  diversity  of  approaches  to  the 
solution  of  (l),  and  by  mentioning  (sections  3 to  6)  some  problems 
associated  with  selected  iterative  methods.  The  newest  process  on 
the  roster,  the  method  of  conjugate  gradients,  is  outlined  in  section 
7.  Section  8 touches  on  the  difficult  general  subject  of  errors  and 
"condition,"  while  a few  words  are  hazarded  in  section  9 about  the 
effect  of  machines  on  methods. 

Whether  or  not  the  subject  proves  interesting,  the  bibliography 
is  intended  to  make  the  field  look  at  least  respectable l It  is  a 
representative  condensation  of  the  5>00-odd  titles  in  the  author’s 
file,  most  of  which  are  in  [2y].  There  are  bibliographies  on  re- 
lated subjects  in  Collatz  [16] , Dwyer  [20],  Engineering  Research 
Associates  [22],  Frame  [36],  Frankel  [37] s Franklin  and  Hankam  [38a], 

<vV  AA<» 

Harvard  Computation  Laboratoiy  [U7],  Higgins  [3&],  KuroS,  Markusevlc, 
and  Rasevskii  [71],  Motzkin  [82],  Schwerdtfeger  [106],  and 
Taussky  [116]. 
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It  is  remarkable  how  little  is  really  understood  about  most  of 
the  methods  for  solving  (l) . They  are  being  used  nevertheless,  and 
yield  answers.  This  disparity  between  theory  and  practice  appears  to 
be  typical  of  the  gulf  between  the  science  and  the  art  of  numerical 
analysis. 


2.  Survey  of  methods.  It  usually  surprises  the  uninitiated  to 
learn  the  variety  of  methods  actually  used  to  solve  systems  (l)  . Sur- 
veys of  methods  are  given  by  Bodewig  [12],  Dwyer  [20],  Faddeeva  [23], 
Forsythe  [26] , Fox  [33,  3k] , Frazer,  Duncan  and  Collar  [39],  Hotelling 
[58] , Householder  [59,  60] , Zurmuhl  [131],  and  others,  but  the  spectrum 
of  known  methods  is  considerably  broader  than  any  of  these  surveys 
suggests.  A classification  of  elimination  methods  is  given  by  Jensen 
[gy.  A tentative  classification  of  all  known  methods,  with  a biblio- 
graphy of  about  U50  titles,  is  in  [27]  „ 


As  mentioned  in  section  1,  one  can  solve  a system  (l)  explicitly 
by  use  of  determinants,  while  explicit  solutions  in  other  forms  are 
sometimes  available^  Friedrich  and  Jenne  [U0],  for  example,  describe 


the  use  of  continued  fractions. 

The  best  known  methods  are  based  on  systematic  elimination  of 
unknowns  from  equations  in  the  system  (l)  in  the  fashion  of  high- 
school  algebra,  as  described  by  Gauss  [l|2] . The  elimination  amounts 
to  triangularizing  the  matrix  A by  premultiplying  it  by  a triangular 
matrix,  as  Banachiewicz  [5]  and  Turing  [122]  point  out.  The  process 

A-'”'—-  , — 

can  be  rendered  very  efficient  numerically  by  consolidating  opera- 
tions; see,  for  example,  Benoit  [7],  Dwyer  [20],  and  Turing  [122]. 


' 


* 


When  A is  positive  definite,  the  method  is  equivalent  to  the  succes- 
sive orthogonalization  of  the  unit  vectors  in  the  A metric  by  the 
Gram-Schmidt  process  [ 99] • With  other  metrics  the  orthogonalization 
process  yields  different  methods  for  solving  (l)  . All  these  elimina- 
tion methods  can  also  be  performed  on  any  matrix  of  submatrix  blocks 
formed  by  partitioning  A,  an  idea  suggested  by  Boltz  [131  and  based 
on  relations  given  by  Schur  [105>] . The  various  elimination  methods 


are  direct,  a term  defined  at  the  end  of  section  2. 

There  is  a group  of  direct  methods  related  to  the  characteristic 

polynomial  0 of  some  matrix.  For  example,  if  one  can  learn  that 

0(A)  s cnAn  + **•  + c^A  +1=0,  then,  as  Bingham  [8]  notes,  one  can 

compute  A~^b  * -cnAn  "Hs  - *°°  - CgAb  - c-jb.  Similar  remarks  apply 

when  0(H)  is  known  for  an  operator  H associated  with  the  solution  of 

(l)  j see  section  1;.  There  are  related  methods  involving  the  succes- 

2 n 

sive  orthogonalization  of  the  vectors  Axq,  A xq,  •••,  Axq  in  ihe 

-1 

I,  A,  A , or  other  metrics  $ one  of  these  is  given  in  section  7. 

There  is  an  unpublished  direct  method  of  Hans  Lewy  using  the 
theory  of  congruences  in  n dimensions,  applicable  mainly  when  the 
components  of  A~\>  are  integers.  It  is  based  on  the  use  of  stencils 
for  solving  the  system  A x ■ b , where  a. . » 0 (if  a. . is  even)  and 
a^  “ 1 (if  a^  is  odd),  and  where  b^  is  similarly  defined. 

Of  a quite  different  nature  is  a group  of  iterative  processes 
devised  by  Jacobi  [63],  Nekrasov  [88],  Richardson  [98],  and  others, 
and  subsumed  by  Cesari  [l£],  Wittmeyer  [12?],  and  Geiringer  [UU]  in 
a general  class  of  linear  iterative  processes  to  be  discussed  in 


. 


' 
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section  3.  In  these  methods  one  starts  with  an  arbitrary  first  vector 
xq  approximating  A-1b.  For  k * 0,  1,  2,  •••  , the  components  of  the 
k-th  approximant  vector  x^,  are  systematically  corrected,  often  in 
the  cyclic  order  1,  2,  ° ° - , n,  but  sometimes  in  blocks  or  otherwise. 
After  one  cycle  of  the  corrections  has  been  completed,  the  components 
of  the  resulting  vector  have  beai  obtained  by  solving  a 

matrix  equation  Bx^^  + Cx^  = b for  x^-^,  where  B + C “A.  The  dif- 
ferent linear  processes  are  distinguished  by  the  choice  of  B,  which 
must  be  an  easily  invertible  matrix.  For  example,  B can  be  the  lower 
triangle  of  A [88] , the  diagonal  of  A [63],  a sdalar  matrix  [98], 
diagonal  blocks  of  A (von  Mises  and  Pollazek-Geiringer  [123] , Hertwig 
[U9]),  etc.  (Wien  B is  the  lower  triangle  of  A,  the  iterative  method 
is  commonly  called  the  ’Seidel  process,'  or  the  ’Gauss-Seidel  process.’ 
But,  as  Ostrowski  [91]  points  out,  Seidel  [10 7]  mentions  the  process 
but  advocates  not  using  it.  Gauss  nowhere  mentions  it.  Nekrasov  [88] 
studies  its  properties  and  says  it  is  Seidel’s  process.  It  will 
henceforth  be  called  the  ’cyclic  single-step  process.")  Under  appro- 
priate conditions  mentioned  in  section  3,  x^.  approaches  A ^b,  but 
ordinarily  no  x^.  equals  A ^b. 

Another  group  of  iterative  processes  includes  those  of  Gauss 
[III],  Seidel  [IO7],  Southwell  [109,  110],  Motzkin  [2]^,  and  others. 

These  are  called  ’’relaxation  methods"  in  [109,  110 ] . They  are  dis- 
cussed by  Black  and  Southwell  [10],  Fox  [32],  Temple  [118] , and 
others,  and  have  proved  especially  useful  in  engineering  work.  They 
are  difficult  to  define  precisely,  since  the  computer  is  essen- 
tially advised  to  use  all  the  art  and  artifice  he  can  muster  to 
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find  x such  that  r « b - Ax  is  near  0.  Their  predominant  feature, 
however,,  is  that  the  components  of  x are  corrected,  not  in  a pre- 
determined order,  but  in  an  order  of  "worst  first  If  this  feature 
is  adopted  as  defining  relaxation,  the  iteration  function  (defined 
in  section  3)  depends  on  x in  a piecewise  linear  fashion,  and  the  ana- 
lytical character  of  the  processes  is  completely  different  from  that 
of  the  related  linear  processes  0 Relaxation  has  been  studied  recently 
in  connection  with  solving  systems  of  linear  inequalities ; see 
Agmon  [2],  and  Motzkin  and  Schoenberg  [83], 

Other  nonlinear  iterative  processes  include  the  least-squares 
iterative  methods  discussed  in  sections  5>,6,  and  7.  They  start  with 
Cauchy  [lU],  and  are  synthesized  by  Temple  [118],  Rosser  [unpublished], 
Hestenes  and  Stein  [53],  and  others*  Special  cases  include  certain 
linear  processes  which  essentially  deal  with  a positive  definite  matrix. 
For  example.  Kaczmarz  [67]  and  Tompkins  [120]  interpret  the  system 


(1)  as  restricting  A ’■Lb  simultaneously  to  n hyperplanes . A first 
guess  xq  is  projected  successively  on  each  hyperplane  in  cyclic  order. 
Then  the  distance  |x^  - A"°  | decreases  monotonically  to  0.  De  la 
Garza  [19]  proposes  a related  process.  When  A > 0,  the  cyclic 
single-step  method  and  all  the  relaxation  methods  are  also  least- 
squares  methods.  This  fact  is  the  basis  of  many  studies  of  the  relaxa- 
tion methodsj  see  [118]  and  Ostrowski  [93]. 

Least-squares  processes  which  are  not  linear  include  the  gradient 

,y 

methods  of  [llj.]  developed  in  [118],  and  by  Kantorovic  [68]  and  Birman  [9]. 
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see  section  !?.  Their  culmination  appears  to  be  the  conjugate  gradient 
method  of  Hestenes  [£0,  5»U] , Lanczos  [72]  ? and  Stiefel  [llh,  5k],  this 
is  a finite  iteration  described  in  section  7.  A gradient  method' in 
a more  general  metric  is  mentioned  in  section  6. 

Seemingly  very  different  are  the  Monte  'Carlo  methods  for  solving 
(l),  employing  random  sampling  of  a certain  chance  variable  whose  ex- 
pectation is  A \>.  One  such  method,  devised  by  von  Neumann  and 
Ulam  [unpublished],  is  described  by  Forsythe  and  Leibler  [28 ] 5 Wasow 
rig]  devises  another  Monte  Carlo  method.  Both  are  based  on  proper- 

t 

ties  of  discrete  random  walks  in  a space  of  n points.  When  the  system 
(l)  represents  Laplace's  or  related  difference  equations  in  one  or 
more  dimensions,  the  Monte  Carlo  methods  have  a longer  history;  see 
the  exposition  and  bibliography  in  Curtiss  [18].  The  methods  are 
theoretically  fascinating,  but  there  is  little  evidence  yet  of  their 
practical  utility  for  solving  linear  systans. 

The  iterative  processes  for  solving  (l)  are  likely  to  converge 
slowly,  and  a number  of  tricks  have  been  devised  to  speed  them  up, 
called  acceleration  processes.  Accelerations  of  linear  processes  may 
themselves  be  either  linear  or  nonlinear;  some  are  described  in 
section  U.  A number  of  processes  for  accelerating  the  nonlinear  gra- 
dient methods  are  mentioned  at  the  end  of  section  and  the  conjugate 
gradient  method  of  section  7 may  be  considered  also  as  an  acceleration. 

1 

The  distinction  between  direct  and  iterative  methods  is  ordi- 
narily stressed  in  numerical  analysis;  see  [23],  for  example.  Applied 
to  systems  (l),  a direct  method  is  one  which  yields  k~\>  exactly 
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after  a finite  number  of  arithmetical  operations,,  if  the  latter  are 
performed  without  round-off  error.  An  iterative  process  can  ordinarily 
yield  A~"ho  only  as  the  limit  of  a sequence  of  exact  arithmetical  opera- 
tions. However,,  it  must  be  remembered  that  as  soon  as  calculations 
are  rounded  off  (as  ordinarj^r  occurs  in  machine  calculation),  direct 
methods  disappear  except  for  quite  simple  problems,  and  all  methods 
become  iterative!  see  the  end  of  section  8.  The  practical  distinction 
between  methods  for  solving  (l)  then  appears  to  depend  on:  (a)  the 
speed  of  the  convergence  to  A~'H>,  and  (b)  the  simplicity  of  the  compu- 
tation at  each  stage  of  the  iteration.  A two-way  classification  of 
methods  might  well  be  based  on  their  behavior  with  respect  to  proper- 
ties (a)  and  (b) , and  that  of  [2£]  was  roughly  of  this  type.  Two 
difficulties  make  such  a classification  imprecise.  First,  the  theory 
of  round-off  error  is  too  poorly  developed  to  yield  rates  of  conver- 
gence to  A”^b.  Second,  the  practical  criteria  of  simplicity  in 
machine  computation  vary  too  greatly  among  different  machines. 

One  may  also  distinguish  whether  the  matrix  A is  altered  in  the 
course  of  solution,  as  in  elimination,  or  whether  it  is  retained  in 
its  original  form,  as  in  the  conjugate  gradient  process.  This  is 
probably  a crucial  distinction  in  practice,  for,  when  the  original  A 
is  retained,  the  round-off  errors  seem  to  accumulate  more  slowly! 
see  section  9. 


3*  Linear  iterative  processes.  An  iterative  process  for  solv- 
ing (l)  (or  other  equation)  is  determined  by  the  functions  where- 
with the  (k  + l)-th  approximant  to  A*^b,  xk+l-’ 


is  derived  from  the 
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earlier  approximants  XQ?Xp  •••  , x^..  If  the  only  argument  of  F^  is 
x,  s the  iterative  process  is  said  (Schroder  [103])  to  be  of  first 
degree;  “ F^(x^)  . If  the  function  F.^  is  independent  of  k,  the 

process  is  called  stationary . A bibliography  on  iteration  as  such  has 
been  prepared  by  Schwerdtf eger  [106] „ 

As  elsewhere  in  mathematics,  the  most  studied  functions  are  the 
linear  ones . We  Introduce  the  (most  general)  linear  iterative  process 
of  the  first  degree  by  the  definition 


(2) 


*k+!  " W ■ Vic  + \ > 


where  the  H^.  are  square  matrices  and  v^  are  vectors.  If  the  iterative 
process  described  by  (2)  is  to  solve  (l) , it  seems  essential  that  the 
solution  A ^b  be  a fixed  point  of  F^: 


(3) 


A J'b  “ F^(A  ^b)  « H^A  + v^ 


We  demand  that  (3)  hold*  It  follows  that 


(U) 


\ ■ (i  - v a“^  * v • 


If  Hj  and  M,  are  independent  of  b,  then  (6)  follows  from  (U)  . Thus: 

'RtJvx  — 

The  most  general  stationary  linear  iterative  process  of  the 
first  degree  for  solving  (l)  which  is  independent  of  b and  which 
satisfies  (3)  Is  defined  by  the  functions 


(*) 


Vi  ' W ■ Vic + V » 


where  the  square  matrices  and  depend  only  on  A and  satisfy  the  relation 


' 


’ 


■ 
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(6)  + \A  - I * 

If  the  process  is  stationary,  then 

(7)  ^+1  53  + Mb  , 

where  the  square  matrices  H and  M depend  only  on  A and  satisfy  the  relation 

(8)  H + MA  - 1 . 


Stationary  processes  of  the  type  (7),  (8)  have  been  studied  by 
Cesari  [15],  Wittmeyer  [127],  Geiringer  [UU],  and  others.  They  in- 
clude  the  cyclic  single-step  iteration  of  Nekrasov  [88]  and  Liebmann 
[7U],  and  those  of  Jacobi  [63],  Richardson  [98],  Frank el  [38],  Young 

AA^  /W  yw 

[129],  and  many  others.  In  the  cyclic  single-step  process,  for  exam- 
ple, one  writes  A as  the  sum  of  two  matrices  B “ (b.  .)  and  C ■ (c.  .), 


where 


r 


aij  (i  4 3) 


(i  <j) 


and  c.  . ® ' 
1J 


o (i  » j) 


aij  ^ < 3) 


Then  + CXy  b,  or  + B~o;  it  is  assumed  that  no 

a..  - 0.  Thus  H - - B_1C,  while  M - B"1. 

Let  us  consider  the  convergence  of  the  linear  processes,  under 
the  assumption  that  all  arithmetic  operations  are  carried  out  with 
perfect  accuracy.  From  (2)  and  (3)  it  follows  that 


(9) 


Vi - A_lb  ■ ■ A_lb)  > 


■ 


> 


. 

' 

" 
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whence 


(10)  2^  - iTh  -‘Kk(xo  - Ch)  , 

where  K^.  » ' ' * HiH0  * or<^er  that  - A™^  0 for 

arbitrary  xq  it  is  therefore  necessary  and  sufficient  that 

(11)  lira  K,  z - 0 , for  every  vector  z. 

k-»oo  K 

In  practice  condition  (11)  is  usually  known  to  hold  only  in 
certain  special  cases,  such  as  when? 

(a)  all  are  polynomials  in  some  one  matrix  - for  example,  A ; 

(b)  all  = H (stationary  process)  | 

(c)  with  some  norm  function  ||*||  , all  f|  H.  |J  6 1 - £ <1. 


Henceforward  we  consider  only  case  (b) : stationary  linear  process- 

es. Then  » H^,  and  it  is  known  (see  [ 9U],  for  example)  that  (ll) 
holds  if  and  only  if  each  eigenvalue  X^(H)  of  H is  less  than  one  in 
absolute  value.  Thus,  we  have  derived  the  well  known  result  that  in 
the  stationary  linear  process  x^  A“1b  for  all  xq  if  and  only  if 
all  | A ±(H)  | <1. 


This  result,  while  exceedingly  important,  hardly  begins  to  solve 
the  practical  computer’s  problem.  He  raises  the  following  questions: 
(i)  How  can  one  tell  whether  all  j X ^(H)  j < 1 ? (ii)  If  all 
| X^(H)  J < 1,  how  fast  will  the  convergence  be?  (iii)  If  the  conver- 
gence is  intolerably  slow  (as  the  computer  realistically  expects), 
how  may  it  be  speeded  up?  To  a considerable  extent  these  questions 


. 

■ 
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are  answered  by  a knowledge  of  the  asymptotic  behavior  of  the  error 
vector  - A ^b  as  k 4 oo„  Question  (iii)  will  also  be  dealt  with 
in  section  U. 

The  general  theory  of  the  Jordan  canonical  form  [78,  9U]  of  the 

A/  '\Sm* 

matrix  H can  be  applied  to  discuss  the  asymptotic  behavior  of 

xk  - A 1b.  Suppose,  as  is  usually  the  case,  that  H has  the  diagonal 

canonical  form 


(12) 


H 


A-c.r? 

111 


s 


T 

where  c^,  r^  are  respectively  the  column  and  row  eigenvectors  of  H 
belonging  to  . Suppose  that  xq  - A H>  * From  (10)  one 

then  finds  that 


(13) 


n 


" A"^  - Hk(x  - A-1b)  - J 

i5®! 


If  Xn  is  a unique  eigenvalue  of  maximum  modulus,  and  if  y / 0,  then 
(1U)  x^  - A~^b  - YnX^cn  (as  k -4  00) 

This  indicates  that  asymptotically  i A ^ along  a certain  line,  a 

fact  which  is  often  useful  in  accelerating  the  convergence  of 

When  An  is  unique  but  « 0,  the  relation  (lU)  does  not  hold 
for  exact  operations*  However,  a calculation  involving  round-off 
errors  will  soon  have  the  practical  effect  of  making  y^  / 0.  If 
several  eigenvalues  dominate  in  absolute  value,  formula  (lU)  must 
usually  be  modified  to  include  several  terms . In  any  case,  when  (12) 


' 


■ 


* 


* 


iu 


holds,  formula  (13)  shows  that  |x^  ~ A^^b  j approaches  zero  linearly  — 
i.e.,  with  the  order  of  the  k-th  power  of  rftax^|  A^|.  When  (12)  fails 
(i.e.,  when  H does  not  have  a diagonal  Jordan  canonical  form),  formula 
(13)  has  to  be  altered.  The  important  fact,  however,  is  that  the 
basic  mathematical  theory  is  already  in  existence  for  dealing  with 
linear  iterative  processes  — at  least  as  long  as  round-off  errors 
are  not  considered  in  detail.  (It  is  commonly  supposed  that  a non- 
diagonal canonical  form  would  never  arise  in  practical  analysis.  How- 
ever, if  one  uses  the  Liebmann  [?U]  process  to  solve  the  usual  differ- 
ence equation  corresponding  to  the  Dirichlet  problem  in  one  or  more 
dimensions,  the  matrix  H turns  out  to  have  a non-diagonal  canonical 
form.) 

Recent  developments  in  stationary  linear  processes  have  included 
a suggestion  of  Aitken  [U]  and  others  for  making  H > 0 (i.e.,  a posi- 
tive  definite  symmetric  matrix),  so  that  all  A^(h)  will  be  real  and 
positive,  insuring  that  (13)  will  in  fact  be  dominated  by  one  term. 

In  the  cyclic  single-step  process  this  can  be  achieved  by  solving  for 
the  components  in  the  " to-and-fro*'  order  1,  29  •••  , n,  n - 1,  •••  , 2 
and  repeat. 

1 

According  to  Ostrowski  [91],  for  the  cyclic  single-step  prbcess  it 
was  first  proved  by  Pizzetti  [95>]  that,  if  A > 0,  all  i A ±(H) ! < 1.  A 
converse  recently  proved  by  Reich  [97],  and  more  simply  by  Ostrowski 

r\s~^  1 

[92],  asserts  that  if  A is  symmetric  and  each  a^.^  > 0,  then  all 

v'VV*  ““  — — “ H 

| X^(H)  I < 1 if  and  only  if  A > 0.  Collatz  [17] > Stein  and  Rosen- 
berg [113],  and  others  have  studied  the  relations  between  the  A ^ (H) 


. 


l'  ; 

. 
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for  the  cyclic  single-step  process  and  those  for  the  related  total-step 
process  of  Jacobi  [63].  Other  recent  developments  [38.  129]  include 
the  alteration  of  H by  simple  devices  so  that  max^j A is  reduced  as 
much  as  possible.  In  the  Liebmann  process  for  the  Dirichlet  problem, 
a certain  systematic  "over-relaxing"  has  just  this  effect. 

U.  Acceleration  of  ^tatioiw^-  linear  processes.  Since  ordinarily 
the  stationary  linear  process  (?)  seems  to  converge  slowly,  or  even 
diverge,  it  is  of  the  greatest  practical  importance  to  devise  methods 
to  improve  the  convergence.  Any  procedure  which  replaces  by  a 
vector  closer  in  some  sense  to  A=^b  is  loosely  called  an  acceleration 
of  the  iterative  process  (7)  . An  acceleration  is  ordinarily  a 
summability 

best  understood  accelerations  are,  like  the  iterations  themselves, 
linear  processes,  in  which  the  improved  vector  is  a linear  combination 

of  V xl>  '**  > V 

The  simplest  linear  acceleration  is  useful  when  H has  a unique 
dominant  eigenvalue  An?  By  (lU),for  large  k 


applied  to  the  sequence  defined  by  (13)  . The 


05) 


Vi  ' A_lb  1 Xn(xk  " A_lb)  • 


Hence  the  formula 

(16) 


-L  . y“k*i"  * nSc 

D a 


n 


may  be  expected  to  give  a better  approximation  to  A "Hd  than  gives, 
even  when  An  is  known  only  approximately.  This  idea  has  been  ap- 
plied by  Flanders  and  Shortley  [2k],  Lyusternik  [77]*  and  others. 
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The  same  idea  can  be  extended  to  take  care  of  two  or  more  dominant t 
A..  Let  E be  the  operator  increasing  by  one  the  subscript  k of  x: 
x^+^  * Ex^.  One  can  write  (16)  in  the  form 

(17)  » P1(E)  ^ , 

where  P,  (E)  " (1  -■  A (E  - A ) is  a polynomial  of  degree  one  in 

X XI  XX 

E • Let  m be  an  integer . Consider  the  formula 

(18)  A-1b  i Pm(E)  Xy  , 

where  P^  is  some  polynomial  of  degree  m.  How  shall  we  determine  P^ 
so  that  (18)  is  a good  approximation?  If  we  can  answer  this  effec- 
tively, then  (18)  is  a useful  linear  acceleration  of  the  process  (7). 

In  order  that  the  operation  (l8)  not  worsen  an  already  good  appro  xi 
mation  x^,  we  must  have 

(19)  P(l)  “ 1 . 

m 

For  if  xk  - pT\}  then, by  (3),  x^+1  * xk+2  - •••  - k~\,  so  that 

VE)  \ *k ' p«(1)  ch>- 

If  m - n ■ the  order  of  H,  and  if  0(A)  is  the  characteristic 
polynomial  of  H,  then  the  choice  P^(E)  * 0(E) /0(l)  is  perfect,  in 
that  Pm(E)  x^  is  exactly  equal  to 

0 is  unknown,  and  to  obtain  it  would  involve  precise  knowledge  of 
all  the!  eigenvalues  A^  of  H. 

Suppose  that  the  eigenvalues  although  not  known  precisely, 

are  known  all  to  be  in  a certain  closed  set  R of  the  complex  plane 


for  all  x^.  But  ordinarily 


. >. 


■ 
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(for  example,  a real  interval).  By  (8),  since  d(A)  ^ 0,  the  number  1 
cannot  be  an  eigenvalue  of  H.  Hence  xire  may  assume  that  ft  is  bounded 
away  from  the  point  1.  Now  the  eigenvalues  of  Pm(H)  are 
Since,  by  (13), 


Pm(E)  ^ - A~\  - Pm(H)(x^  - , ! 0 

it  is  essential  for  the  success  of  the  acceleration  that  all  |P  (A.)  I 

1 m i 1 

be  small.  This  leads  to  the  approximation  problem  of  determining  the 
polynomial  Pm  of  degree  m such  that  (19)  holds  and  such  that 


(20) 


max  |pj  A ) I is  a minimum 
X£R  “ 


Such  polynomials,  named  generically  after  Cebysev  (*  Chebyshev  * 
Tschebyscheff  * • • •)  arise  frequently  in  numerical  analysis. 

If  R is  a real  interval  not  containing  1,  the  minimizing  polynomial 
P^(X)  is  proportional  to  the  ordinary  Cebysev  polynomial  T ( A ) for 
the  interval  Rj  see  W,  Markoff  [79].  For  just  this  reason  the  poly- 
nomials  T^  have  been  used  several  times  in  matrix  problems  to  suppress 
the  effects  of  unwanted  eigenvalues;  see  Abramov  [1],  Flanders  and 


Shortley  [ 2U3 , Gavurin  [b3],  and  Lanczos  [73]*  A recent  review  of 

y'  ^ 

Cebysev  approximation  problems  in  the  complex  plane  is  in  [80]. 


The  real  difficulty  in  numerical  analysis  is  that  R is  not  knoxm. 

How  can  the  information  gained  from  computing  xq,  x^,  •••  , x^.,  •••  , x^+m 


V y/ 

be  used  effectively  to  localize  R,  so  that  the  Cebysev  problem  can 
be  formulated?  After  R is  localized,  how  can  the  corresponding 


. 


. 


- 


' 
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P^(E)  be  determined  approximately?  These  are  important  questions 
which  are  in  need  of  treatment.  'The  use  of  symmetric  matrices  H helps 
a great  deal,  by  restricting  R to  the  real  axis. 

Another  acceleration  device  is  due  to  Schulz  [lOU],  Suppose 
A » I - H,  and  that  (2l)  converges,  A linear  process  analogous  to 
(7)  obtains  A ^ as  the  sum  of  a Neumann  series 

(21)  (I  - hF1  - I + H + H2  + •••  , 


often  slow  to  converge.  By  [10U]  one  can  obtain  the  partial  sum 

n 

X =I  + H+"'*+H2  1 of  (21)  by  n iterations  (22) s 


(22) 


Xk*l  ’ Xk<21  - V » Xo'1  • 


a total  of  2n  matrix  multiplications.  The  use  of  (22)  to  get  A is 
sometimes  called  Newton’s  process.  Can  this  idea  be  adapted  to  solving 
the  system  (l)  without  getting  A~^  first? 

For  accelerating  the  convergence  of  any  linear  process  one  also 
has  the  S ^-process  of  Aitken  [3$  U]  and  its  extensions  by  Shanks  [108] 

rv/-. 

and  Samuelson  [101],  Lubkin  [76]  has  studied  it  as  a nonlinear  se- 
quence- to-  sequence  summability  process.  It  requires  no  knowledge  of 
the  A . (H)  . Let  y.  represent  an  arbitrary,  but  fixed,,  component  of 
x^..  Then  the  functional  character  of  y^  in  a linear  process  is 
given  by 


(2  3) 


n 


lx 


00 


i=l 


+ V 8.  A? 

1 1 


where  y^  is  the  desired  component  of  A "Sd,  and  s^  are  numbers. 
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To  determine  y from  (23)  it  is  theoretically  sufficient  to  have  ex- 
actly  2a  + 1 successive  values  of  y^  — for  instance, 
y , y , **•  , ^n9  Prac^ce  the  elimination  of  the  and  A^ 
would  be  too  tedious,  but  frequently  one  A^?  say  > , predominates 

in  (23)  , and  for  moderately  large  k one  may  ignore  s^,  for 

2 ' 
i “ 1,  , n - 1.  In  the  8 -process  one  assumes  that,  with  suffi- 

V 

cient  accuracy,  y^.  ® y^  + sn  An5  and  calculates  y ^ from  three  suc- 
cessive y^.  The  formula  is  thiss 

¥w'Vi  _ (V2 

y»  ■ ' 7k " 1^7  • 


In  [108],  m exponential  terms  are  eliminated  from  (23)  at  once. 
If  m « 2,  for  example,  one  can  show  that 


f?.  Least^squares  methods.  A variational  approach  to  solving  (l) 
is  very  popular  now.  In  a general  treatment  over  the  complex  field 
(the  present  treatment  is  confined  to  the  real  field),  Hestenes  and 
Stein  [£3]  take  a matrix  R > 0 and  let  (2  Rz)2  » |z|R  be  the  R-length 
of  a vector  z.  (The  T stands  for  transposition.)  If  B " A RA  and 
c ■ A^Rb,  then  B > 0 also.  Let  the  deviation  of  x from  A be 


measured  by 


. 


. 


? ' » ' - ..  ■:  fcfci-rtt- 
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(2U) 


f(x)  * | Ax  - b|^  » |x  - A=1b|g 
* xTBx  - 2xTc  + jb  |j? 


Starting  from  an  initial  vector  x0,  one  can  move  x in  various  direc- 
tions with  the  object  <f  minimizing  f(x).  Clearly  the  minimum  is  at- 
tained just  when  x - A~'!'b. 

To  simplify  the  exposition  we  now  assume  A > 0 and  take  R * A-^. 


Then 


(25>)  f (x)  - xTAx  - 2xTb  + bTA~1b  . 

Although  f (x)  is  not  computable  unless  A "Ha  is  known,  it  is  sufficient 
in  practice  to  minimize  the  computable  function  f(x)  - b^A”''‘b.  Fix  x, 
and  let  r » b - Ax  be  the  residual  of  the  systan  (l)  at  x.  Let  d / 0 
determine  a direction.  Since  f(x  + °<  d)  ■ °<-  d Ad  - 2<xd  r + f(x), 
the  value  of  the  real  parameter  for  which  f(x  + ^-d)  is  a minimum  is 


(26) 


oCf  - d^r/d~Ad 


this  is  called  the  optimum  oC  (corresponding  to  x and  d) . 

For  any  ©C  » one  can  compute  f(x  + <vd)  from  the  relation 

f(x)  - f (x  + d)  - (2«- - ot^)d^Ad,  i.e.,  f(x)  - f(x  + /3°<^d)  - 
(3(2—  /S)«-’r^dTAd.  Thus  (when  d^r  / 0),  f(x  + ^^^d)  < f(x)  for 
0 < /3  < 2.  The  greatest  reduction  in  f(x)  comes  when  - 1. 

i 

We  now  describe  a general  least-squares  iterative  process  for 
solving  (l) . There  must  be  prescribed?  (i)  a start  xo$  (ii)  a se- 
quence of  directions  ; (iii)  a sequence  of  ratios  W- 

(In  but  not  here,  R must  also  be  prescribed.)  For  each 


- 


h 

. 
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k a 0,  1^  2,  ***  j one  determines  °C  ® by  (26)  so  that  f(x^.  + ^ d^) 
is  minimized.  Then  one  lets 


(27) 


Vi  m\  * 'V'fo  > 


T / T 

where  r^  - b - Ax^.  and  “ - d^r^/d^Ad^.  (It  is  not 
the  A and  the 
If  the  sequences 
in  [f>3]  that,  independently  of  xq,  f(x^)  -*-»  f(A*”'Lb), 
that  » A "H).  The  conditions  are; 


d^,  be  determined  a priori;  they  may 
and  fdjT  satisfy  (28)  and 


specified  that 
depend  on  x^.) 
(29)  it  is  shown 
as  k — » co,  so 


(28)  0 < &1  < /3y  4 2 - ^ < 2 (all  k)  } 


(2?)  0 < S2  - |d^rk|/(|djc|-!rk|)  (all  k)  . 

In  [E>3]  an  alternate  to  (28)  states  that  the  d,  recurrently  span  the 
space  in  a certain  uniform  manner. 

Among  others,  the  following  two  least-squares  processes  are  also 
linear  iterative  processes;  (a)  the  and  are  independent  of 
the  x^j  (b)  d^  r^  but  & ^ * ^k°^k  * a constant.  Wien  in 

(a)  all  /?k  ® 1 and  the  d^.  are  the  coordinate  unit  vectors  in  cyclic 
order,  one  has  the  cyclic  single-step  process  of  Nekrasov  [88],  The 
use  of  f(x)  is  veiy  useful  in  studying  any  single-step  process.  In 

(b)  one  can  write  the  linear  iteration  function  of  section  3 in  the 
form  F(x)  * x + ol( b - Ax)  = (I  - °CA)x  + b;  the  process  is  due  to 
Richardson  [98]  and  to  von  Mises  and  Pollazek-Geiringer  [123],  and 
converges  whenever  0 < °^  < 2/max.  ^ .(a). 

ZL  ZL 
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In  the  general  case,  however,  the  least-squares  process  is  non- 
linear. When  83  r^  53  - \ grad,  f(x^),  it  is  called  a gradient 
method  (or  method  of  steepest  descent)  . When  f3^  3 1,  one  has  the  optimum 
gradient  method  (since  f(x^  + d^)  is  minimized  as  a function  of  ex), 

proposed  by  Cauchy  [lU]  and  studied  by  Temple  [118],  Kantorovic  [68], 
Birman  [9] s and  by  Hestenes  and  Karush  [5>2]  for  the  eigenvalue  problem. 
Some  variations  of  the  method  are  treated  by  Krasnosel’skii  and 
Krein  [70a]. 

By  (2£)  the  surfaces  f(x)  ■ constant  are  similar  and  similarly 
situated  ellipsoids  whose  common  center  A “4)  we  seek.  Ary  approxi- 
mant  x^.  lies  on  a certain  ellipsoid  of  the  family.  The  gradient, 

-2r^,  lies  in  the  normal  to  at  x^,  Now  x^^  is  the  unique 

point  of  (x^.)  for  which  f(x)  is  minimized.  Since  f(x)  is  a quad- 
ratic function  of  distance  along  the  normal,  x^.+^  is  located  halfway 
between  the  two  intersections  of  '^C25^  “with  S^.  Moreover,  x^+-^  i-s 
the  point  where  ITj  (x^.)  is  tangent  to  an  ellipsoid  of  the  family. 

Let  0 < A * * - 

[68]  shows  that 


*=  be  the  eigenvalues  of  A.  Kantorovic 


(30) 


Fi 


< 1 


(all  x^) 


From  this  it  follows  that  f(x^)  i 0,  so  that  x^.  A ^b  and  the  pro- 
cess converges.  As  we  mentioned  in  section  3}  the  practical  computer’s 
problem  merely  begins  ■with  such  knowledge.  Some  experience  convinces 
him  that  (a)  is  frequently  very  close  to  1,  and  (b)  after  a few 


_ •' 


. 
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steps  [f  (x^.+^  )/f  becomes  and  remains  very  close  to  see  [25]  . 

Observation  (a)  is  borne  out  by  the  remark  [6 ^ p.  591  that,  for  cer- 

m 

tain  matrices  A of  type  C1C,  X n/ A ^ is  likely  to  be  of  the  order  of 
2 -2 

n , whence  1 - **  n . For  finite-difference  approximations  to 

the  Laplace  operator  over  rectangular  regions  Frankel  [38]  shows  that 

^ , where  n is  the  number  of  lattice  points.  As  to  (b)  , 

2 

for  any  A the  value  can  always  be  attained  by  ^(xjc+^)/;8(xic)  when 
\ ^ assumes  certain  directions  in  the  plane  spanned  by  and 

un  (defined  below).  From  (a)  and  (b)  we  conclude  that  the  optimum 
gradient  method  is  frequently  too  slow  for  practical  use. 

As  a guide  to  the  possible  acceleration  of  the  method  it  would 
be  valuable  to  know  the  asymptotic  behavior  of  x^  - A~^b,  if  arith- 
metical operations  are  done  exactly.  But,  because  x^+^  is  obtained 
from  by  a rational  cubic  transformation,  theorems  are  hard  to 
prove*.  It  is  a conjecture  of  Forsythe  and  Motzkin  [30],  proved  only 
for  n • 3 in  [29],  that  in  the  optimum  gradient  method  the  error 
vector  x^.  - A ^b  is  asymptotically  a linear  combination  of  the  eigen- 
vectors un,  u^  of  A belonging  to  the  largest  ( X and  least  ( X 
eigenvalue  of  A.  (If  there  are  eigenvectors  of  A orthogonal  to 
xq  - A~^b,  one  disregards  the  corresponding  eigenvalues  in  determin- 
ing X and  Xn„)  A proof  of  the  conjecture  for  n ■ U would  be  very 
desirable  because,  when  the  conjectured  asymptotic  relationship 
holds,  for  all  sufficiently  large  k the  points  x^,  x^-^,  xk+2f 
A ^b  are  asymptotically  coplanar.  Thus  one  could  accelerate  the 
convergence  of  the  optimum  gradient  method  by  occasionally  minimizing 


. 

■ 
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* 
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" 

. 
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f(x)  in  the  plane  through  the  endpoints  of  x^.,  and  x^+ ^ e The 

method  has  been  used  successfully' in  experiments  [25]  for  n * 6 on 

•~\S\ 


an  IBM  Card-Programmed  Calculator,  where  the  average  value  of 
f (Xk+i)/f (x^)  over  a series  of  about  100  steps  was  reduced  from  .9733 
(optimum  gradient  method)  to  ,62[|5  (optimum  gradient  method  with  ac- 
celeration every  ninth  step)  . 

A second  idea  to  speed  up  the  optimum  gradient  method,  the 
ft -optimum  method  proposed  by  Hartree  [U6]  and  by  Hestenes  and  Stein 
[£3],  is  to  give  ft,  some  constant  value  ft  in  the  range  0 < ft  < 1. 

In  the  test  problem  of  [25],  Stein  [112]  finds  ft  “ 0.9  to  be  approxi- 

r\s\j 

mately  best,  and  with  it  the  average  value  of  f(x^.+^)/f (x^.)  is  found 
[2^]  to  be  .820U|  in  [112]  .8065  is  found  for  a shorter  run.  Although 


/\ry/ 

the  convergence  of  the  ft -optimum  method  is  slower  (in  this  test)  than 

that  of  the  accelerated  optimum  gradient  method,  the  former  has  the 

considerable  advantage  of  being  shorter  to  code  for  automatic  machinery. 

The  success  of  the  ft  -optimum  method  is  perhaps  due  to  an  inherent 

instability  of  the  transformations . In  2 dimensions  the  transforma- 

-1 

to  ~ A b has  no  stable  fixed  directions  when 

ft  is  slightly  less  than  1 [Motzkin  and  the  author,  unpublished] . 

6.  General  descent  methods.  The  function  f of  (2h)  has  the 


tion  of 


*k 


following  significant  properties:  f(A  ^b)  ■ 0|  f(x)  > 0 if  x / A \>J 

f is  a convex  function  of  x.  Any  f with  these  properties  might  serve 
as  the  basis  for  what  Professor  Motzkin  calls  a descent  method.  In  such 
a method  one  has  a first  guess  xq,  and  sequences  {v}  and  {X}  as 
in  section  5.  As  before,  one  finds  a £ minimizing  f(x^  + « d^) , 
and  selects  x^^  » x^.  + /d^  ^d^ . 
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Other  suitable  f would  be 


n 


(31) 


(32) 


f(x)  - max  Jiv 


i 


Methods  employing  the  latter  norm  functions  f,  with  d^  “ - grad  fCx^.), 


are  somewhat  related  to  - but  apparently  do  not  include  - the  piece- 
wise  linear  iterative  processes.  Agmon  [2]  discusses  a relaxation 


method  for  linear  inequalities  from  somewhat  this  point  of  view. 
Zuhovickii  [130]  gives  a gradient  method  in  the  metric  (32)  for 
minimizing  f(x)  for  incompatible  systems  Ax  ■ b. 

7.  Method  of  conjugate  gradients.  It  is  easy  to  show  that  the 
acceleration  step  of  [25>],  discussed  in  section  5,  is  equivalent  to 
finding  the  unique  point  xf  for  which  f(x')  - f(x^  + Qr^  + ^Ar^) 

is  minimized  as  a function  of  the  two  parameters  <x  q and  This 

V’ 

suggests  a generalization  to  p parameters  discussed  by  Kantorovic 

[68]  and  Birman  [9].  Let  x be  any  point,  and  let  r * b - Ax  . De- 

fine  f(x)  by  (25).  Let  if  (xQ)  be  the  p-space  of  points 

x + c<r  + d,Ar  + ...  + oc  xr  . where  &•••&-  _ are 

o o o 1 o p— 1 o 5 o’  3 p—1 

real  parameters.  Let  x-,  be  the  unique  point  in  ft  for  which  f(x) 
is  minimizeds 


To  determine  x^  one  has  to  solve  a linear  system  of  p equations  in 

I 

the  p unknowns  9 •••  , . , . 


. 


. 


. 

. 
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Define  A ....  A as  in  section  5>  • Let  the  interval 
3 n 

A 4 A ^ A be  transformed  into  the  interval  - 1 4 t 4 1 by  the 
j.  n d 

affine  transformation  t ® Z ( A ),  carrying  into  -1,  and  into  1. 

Then  Z (0)  - S * ( X n + A^)(  Ar  - A^)  Let  Tn(t)  be  the  ordinary 
Cebysev  polynomial  of  degree  n,  normalized  so  that  Tn(l)  ■ max  ]_4t6l  I Tn(t)  I 
™ 1.  In  these  notations  Birman  [9]  has  proved  that,  for  each  A, 

l ~~ 

'2 

' TTT7  ‘ Pp  < 1 (al1  xo>  • 

i 1*' 


(3b) 


f(Xj) 


Thus  (3b)  is  the  extension  to  p > 1 of  (30) . 

Let  t^  ■ tT(X^),  so  that  t^  “1,  tn  “ -1.  For  a given  A,  the 
value  pp  cannot  necessarily  be  attained  by  |?(x^)/f (xQ)p . It  -will  be 
attained,  however,  for  any  A with  eigenvalues  for  which  the  corres- 
ponding t.  include  all  the  p + 1 values  t with  |T  (t)  | ■»  1.  For  such 

i P 

A,  the  xq  for  which  [f(x-^)/f  (xQ)  ]2  = p are  in  the  subspace  spanned  by- 
just  those  eigenvectors  u^  belonging  to  the  A ^ for  which 

XT  ( A 3 | “ 1.  The  case  p = 1 is  exceptional,  in  that  the  maximum 
jj^  is  always  attained  in  (3b)  m (30),  just  because  the  two  values 
t ■ - 1 where  |T-^(t)  j =*  1 are  necessarily  Z -images  of  eigenvalues. 

For  any  fixed  integer  p 4 1 the  above  p-step  process  can  be 
iterated  to  yield  a convergent  procedure  for  finding  A ^b.  Namely, 
for  k “ 0,  1,  •••  , one  obtains  as  ihe  unique  point  x in 

VV  for  which  f(x)  is  minimized.  The  optimum  gradient  method  of 
section  £ is  the  case  p a 1,  If  p < < n we  may  expect  that  there 
are  eigenvalues  A . of  A close  to  the  p + 1 points  where 


■ 

■ 


.■ 


- 


. 
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| C XT  ( A 3 | “ 1,  and  hence  that  the  value  of  (3h)  is  almost  at- 
tained for  certain  x . It  is  then  to  be  expected  that,  for  most  x , 

1 

[f  (xk+1)/f(xk)]2  will  be  approximately  p for  all  large  k.  Moreover, 

if  Xx  <<  An,  then  £ is  near  1 and  p^  » 1 - 2p^(  X An)  . Thus 

the  minimization  in  p dimensions  may  be  expected  asymptotically  to 
2 

proceed  p times  as  fast  as  the  optimum  gradient  method  (p  3 l) , when 

2 

p <<  n.  When  A /A-,  “ n , as  considered  in  section  !?,  the  iterated 

j.1 

p-dimensional  minimization  in  n-spaoe  may  be  expected  to  converge  like 
the  optimum  gradient  method  in  n/p  dimensions. 

The  true  asymptotic  behavior  of  x^  - A is  unknown.  Does  the 
vector  x^.  - A ^b  asymptotically  lie  in  a certain  (p  + l) -dimensional 
subspace,  as  is  conjectured  for  p = 1? 

Because  the  above  iterative  process  requires  at  each  step  solving 
a linear  system  of  p-th  order,  up  to  very  recently  the  method  was  con- 
sidered practical  only  for  p <<  n.  For  p * n,  in  particular,  it 
appeared  that  determining  the  minimizing  o( ^ in  (33)  would  involve 
solving  a linear  system  quite  as  difficult  as  Ax  * b.  Then,  about 
19^1,  Stiefel  [111;] , Lanczos  [72],  and  Hestenes  [!?0],  working  inde- 
pendently at  first,  all  discovered  that  (33)  can  be  minimized  by 
p repetitions  of  a beautifully  simple  algorithm.  By  taking  p - n 
one  actually  finds  A_^b  in  n steps,  except  for  round-off  errors.  The 
resulting  conjugate  gradient  method  is  thus  a typical  finite  iteration. 
An  extended  exposition  is  in  [5>.H]?  while  brief  expositions  are  given 
by  Rosser  [100],  Householder  [£9],  and  Taussky  and  Todd  [117], 
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The  conjugate  gradient  method  is  a nonlinear  stationary  iterative 
process.  The  first  approximant,  is  arbitrary one  takes 
p©  “ rQ  53  b •=  Axq.  For  ary  k “ 0,  assume  the  vectors  x^.,  rk,  p^  have 
been  determined,,  Then  rk+ls  ^k+1  are  determined  in  order  as 

follows i 


(35) 


“k  ■ rW^k  s 

Vi  ■ \ * “Vk  5 
Vl  - rk  - « kApk  s 

\ “ -rk+lAPk/PkAPk  5 

Vi  ■ rk*i  + Vk  • 


Here  r^  a b - Ax^s  and  the  significance  of  p^,  °c^9  fly  will  appear 

below.  In  the  absence  of  round-off  errors  xn  “ A ^b|  if  round-off 

errors  make  xR  ^ A”^b5  one  has  merely  to  carry  the  algorithm  on  for 

k • n + 1,  n + 2a  , until  sufficient  accuracy  is  attained. 

The  kernel  of  a number  of  methods  of  solving  Ax  - b for  A > 0 is 

the  determination  of  a set  of  n directions  (k  • 0,  •••  , n 1) 

T 

which  are  conjugate  (A-orthogonal)  in  the  sense  that  p.Ap.  ■ 0 for 
i / j . If  the  £ p^j-  are  known,,  then 


(36) 


b 


E (Pkb/P^Apk)pk  . 
k*o 


A convenient  method  to  get  the  is  to  apply  the  Gram-Schmidt  pro- 
cess [99]  of  successively  orthogonalizing  some  set  of  n linearly 


29 


independent  vectors  | . In  Gaussian  elimination  (pivotal  condensa- 

tion) the  v^.  are  taken  to  be  the  n coordinate  unit  vectors,  as  Fox, 
Huskey,  and  Wilkinson  [35]  discovered,  and  the  coefficients  defining 
the  orthogonalization  build  up  a triangular  matrix. 

In  the  method  of  conjugate  gradients  the  v^  are  the  vectors 
rk  “ h - Ax^.  (k  s 0,  , r - 1)  • The  beauty  of  this  choice  is  that 

rk+l  turns  out  to  be  automatically  conjugate  to  pQ,  p^,  •••  , p^. 

In  picking  P^+^_  it  is  therefore  necessary  only  to  alter  r^.+^  in  the 
direction  p^|  the  calculation  of  /3  and  in  (35)  has  this  object. 

The  calculation  of  and  is  an  iterative  procedure  for  building 
up  the  sum  (36)  . 

Recalling  that  x actually  minimizes  (33),  we  see  that  in  prac- 

Jr 

tice  the  residual  r_,  may  become  so  small  for  some  p < n that  it  is  un- 
necessary even  to  complete  the  n steps  theoretically  required  to  compute 
A ^b.  This  occurred  for  p « 90  in  a successful  calculation  with  n - 105 
of  a difficult  stress  problem  reported  by  Hochstrasser  [56]  and  Stiefel 
• [111*]  • Such  a saving  could  hardly  occur  with  Gaussian  elimination, 


since  the  unit  vectors  have  no  such  intimate  connection  with  the 
system  (l)  as  do  the  vectors  r^. 

The  conjugate  gradient  method  has  a geometrical  interpretation. 

As  in  section  one  seeks  the  common  center  A of  the  ellipsoids 
f(x)  88  constant.  Let  xq  be  prescribed.  One  gets  x^  by  performing 
one  optimum  gradient  step  (section  5)  in  the  unrestricted  n-dimensional 
space  R . Recall  the  definition  of  ) . There  is  an  (n  - 1) -dimen- 

sional affine  subspace  Rn_-j_  passing  through  x^  and  conjugate  to  ft"^(xo) . 


. 


, 
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The  solution  A lies  in  Rn=t^«  One  gets  by  taking  an  optimum 
gradient  step  -within  R^  The  gradient  p^  of  f(x)  at  x^  within 
Rn„i  is  the  projection  into  Rn  ^ of  the  gradient  r.^  of  f(x)  at  x^ 
within  R^.  The  optimal  point  in  the  gradient  direction  p^  from  x^  is 
X2  “ Similarly,  one  gets  x^.+^  by  taking  one  optimum  gra- 

dient step  from  x^.  within  the  (n  - k)- dimensional  affine  subspace 

Rn=k  conjugate  to  iy^(xo) . Finally,  Rq  is  the  solution 

-1 

point  A b . 

This  is  a bare  description  of  the  method „ In  [5b]  Hestenes  and 


Stiefel  give  an  amazing  number  of  its  properties,  discuss  its  applica- 
tion to  unsymmetrical  systems  (l),  and  so  on.  A few  machine  experi- 
ments [5l9  56s  5l 9 Hit]  with  the  method  suggest  good  stability  with 

/S.A. 

respect  to  round-off  errors,  but  a theoretical  study  of  the  stability 
would  be  desirable. 

The  conjugate  gradient  method  can  theoretically  be  applied  to 
solving  a system  Ax  ® b,  where  A is  a bounded  positive-definite  self- 
adjoint  operator  on  a Hilbert  space.  One  defines  f(x)  = (x.  Ax) 

- 2(x,  b)  + (b,  A= 1b) j the  inverse  operator  A~^  certainly  exists. 

The  method  will  ordinarily  no  longer  converge  in  n steps,  but  the  asymp- 
totic behavior  of  f(x  ) can  be  discussed.  Karush  [70]  shows  that  if 
A «*  oi  I + K,  where  K is  completely  continuous  and  oc  ^ 0,  then  f (x  ) 
goes  to  0 faster  than  the  p-th  term  of  any  geometrical  progression. 

Hayes  [U8]  treats  a general  A with  lower  upper  bounds 

m,  M (0  < m <M  < 00),  and  proves  that  f(x^)  “ [1  - (m/M) ]^f (xq) . 

More  can  be  proved ; 


' 
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Let  S - (M  + m)(M  - m)”'L.  The  Birman  inequality  (3U)  shows  that 
[f(x  )/f(x  )]2  - l/T  ( B)  < 1.  Hence , since  2T  ( 8 ) - [£  + Vi>2  - l]p 


+ [S 
(37) 


, one  gets  the  estimate 

2i/p 


f(x  ) 

rsr 


_1 

2p 


s Vi2-i  s + 


In  another  paper  the  asymptotic  nature  of[f(x  will  be  de- 

P 

scribed  more  precisely  for  a class  of  operators  A with  a continuous 
spectrum. 

8 . Errors  and  " condition .»  One  must  say  something  about  the 
important  but  little-understood  subject  of  errors.  We  may  distinguish 
between:  I.  errors  committed  in  the  course  of  solving  the  system  by 

a specific  algorithm!  H.  errors  inherent  in  the  system  Ax  « b . 

Within  I one  is  concerned  with  the  truncation  errors  and  the 


round-off  errors  of  an  algorithm,  a distinction  explained  in  [12U] • 
The  truncation  error  exists  for  infinite  iterations,  and  may  be 
identified  with  x^  - A ^b!  its  behavior  has  been  examined  in  the 
above  survey,  under  the  assumption  that  there  was  no  round-off  error. 
The  study  of  round-off  error  itself  is  far  more  difficult,  and  there 
seems  to  have  been  a complete  discussion  in  connection  with  only  one 
method,  elimination!  see  von  Neumann  and  Goldstine  [12U,  h£]>  and 
also  Mulholland  [8E>].  For  other  studies  see  Bargmann,  Montgomery, 
and  von  Neumann  [6],  Dwyer  [20],  Satterthwaite  [102],  Thckennan 
[121],  and  Turing  [122]. 
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Any  approximate  solution  f of  Ax  ® b can  be  checked  a posteriori 
by  forming  the  residual  ? = b - A f . The  magnitude  of  A""^b  - I can 
then  be  estimated  by  using  some  tool  for  examining  errors  under  II. 
Hence  to  bound  the  errors  in  a calculated  c,  it  is  unnecessary  to 
have  an  a priori  knowledge  of  the  accumulation  of  round-off  error. 

Such  knowledge  may  be  important,  however,  for  planning  purposes  - for 
example , in  deciding  in  advance  how  many  digital  places  to  carry  in 
order  that  f be  reasonably  small. 

The  errors  under  H have  attracted  mere  study.  The  practical 
analyst,  realizing  that  the  elements  of  A and  b are  subject  to  uncer- 
tainty, wishes  to  know  the  corresponding  uncertainty  in  A ^bj  the 
latter  is  the  inherent  error  of  Milne  [81],  The  usual  approach  is  the 
approximate  one  of  bounding  the  principal  part  of  the  error, 

S (A-1b)  ® A "*■(  S A)A  + A ^Sb;  see  Blumenthal  [11] , Milne  [81], 
Moulton  [8U],  Ostrowski  [90],  Wittmeyer  [126],  and 
Zurmuhl  [131].  But  others  (e.g.,  Janet  [61*],  Ostrowski  [89],  Lon- 
seth  [??])  bound  the  entire  error,  finding  a region  S to  which  A ^b  is 

/w 

rigorously  confined,  given  f?  * b - A^  and  other  reasonably  computable 
quantities  associated  with  the  system.  See  also  Woodbury  [128], 

Various  persons  (e.g.,  Jurgens  [66],  Todd  [119],  and  Turing  [122]} 


have  attempted  to  ascribe  a condition  to  the  system  Ax  ■ b.  In  part, 
the  condition  should  measure  the  influence  on  A of  small  changes 
in  A and  b|  the  larger  the  change  in  A "Hd  for  given  changes  in  A and 
b,  the  "worse"  the  condition.  Although  the  condition  depends  on 
both  A and  b,  the  measures  hitherto  proposed  depend  only  on  A.  The 


p. 
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belief  is  widespread  that  the  condition  of  a system  (l)  has  a decisive 
influence  on  the  convergence  of  an  iterative  solution  and  on  the  ac- 
curacy of  a direct  solution^  this  cannot  always  be  true.  Even  when  it 
is  true  for  an  iterative  process,  it  may  be  possible  actually  to  take 
advantage  of  the  poor  condition  of  (l)  in  converting  the  slow  process 
into  an  accelerated  method  which  converges  rapidly.  There  is  great 
need  for  clarification  of  the  group  of  ideas  associated  with  "condition 
With  the  concept  of  "ill-conditioned"  systems  Ax  • b goes  the  idea 
of  "pre-conditioning"  than.  Gauss  [111]  and  Jacobi  [63]  made  early  con- 
trlbutions  to  this  subject.  That  of  Gauss  is  analyzed  and  extended  in 
[31]. 

4*/s J 

A convenient  means  of  pre-conditioning  is  to  premultiply  the 
system  with  a matrix  B,  so  that  one  has  to  solve 

(38)  BAx  - Bb  . 

—1  T 

The  perfect  choice  of  B would  be  A .A  frequent  choice  is  A , so 

T 

that  (38)  gets  a symmetric  matrix  A A,  very  convenient  for  many  pro- 
cesses, though  "worse  conditioned"  in  some  senses  (Taussky  [UJ?] ) • 

Is  there  in  ary  sense  a "best  choice"  of  B which  is  quickly  obtainable? 

Gauss’  elimination  process  may  be  written  in  the  form  (38),  where 
in  the  absence  of  round-off  errors  BA  is  a triangular  matrix.  In 
some  calculations  the  true  solution  A~ ^b  comes  from  iteration  of  the 
"back  solution"  - i.e.,  of  getting  xk+1  by  an  approximate  solution 
of  the  triangular  system  BA(x  - x^)  - B(b  - Ax^) . Where  this  occurs, 
we  may  interpret  the  "forward  solution"  or  triangularization  as 
merely  a preconditioning  of  the  system  (l)  into  the  form  (38)  • 
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9 • Influence  of  computing  equipment.  The  usefulness  of  a pro- 
cess for  solving  (l)  depends  intimately  on  the  properties  of  the 
machine  on  which  the  calculation  takes  place,,  as  well  as  on  the  special 
character  of  A and  b.  Hie  past  decade  has  seen  revolutionary  develop- 
ments in  computing  equipment:  analogue  machinery,  desk  computers,  IBM 
equipment  and  automatically  sequenced  high-speed  digital  computers • 

As  a result,  computing  methods  are  in  no  way  settled  down,  and  biblio- 
graphies are  out  of  date  before  publication. 

Analogue  machinery  can  be  very  useful,  but  is  not  discussed  herej 
for  references  see  Engineering  Research  Associates  [22],  Frame  [36], 
and  Murry  [86] „ 

While  direct  methods  for  solving  Ax  ® b have  been  little  men- 
tioned here,  they  have  been  very  successful  since  the  time  of  Gauss  or 
earlier.  Dwyer  [20],  Bodewig  [12]  and  others  conclude  that  a compact 
arrangement  of  Gaussian  elimination  is  commonly  the  best  method  for 
the  computing  team  of  a desk  machine,  a data  sheet,  and  a trained 
human  being  - principally  because  the  number  of  operations  is  minimized. 
Elimination  is  very  successful  with  IBM  machines  also,  but  its  superiority 
over  other  methods  is  less  pronounced,  because  it  is  seldom  expedient 
with  IBM  equipment  to  use  the  compact  arrangements  which  save  so  mary 
operations.  A bibliography  on  computing  methods  for  IBM  machinery  is 
given  in  [38a]. 

Let  us  now  consider  automatically  programmed  digital  computers 
like  the  SWAC.  These  are  much  faster  than  previous  computers,  but  the 
speed  of  arithmetic  operations  and  of  access  to  a small  store  of  data 
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(high-speed  memory)  has  been  accelerated  a great  deal  more  than  the 
operations  of  input,  output,  and  access  to  a large  store  of  data. 

The  resulting  change  in  the  relative  costs  of  different  operations  has 
a profound  effect  on  the  choice  of  computing  methods.  O'ne  soon  learns 
that  a variety  of  processes  have  been  tried  and  advocated  for  solving 
(l)s  certainly  the  optimal  method  depends  on  the  problem,  the  machine, 
the  operator,  and  the  coder.  Moreover,  small  changes  in  these  factors 
may  radically  alter  the  optimal  choice  of  method. 

The  following  tentative  ideas  are  based  on  a limited  experience 
with  the  ’SjiIAC,  and  practically  no  experience  with  other  machines. 

(The  analysis  is  dominated  by  the  relative  shortage  of  memory  cells  in 
the  SWACs  it  is  therefore  less  pertinent  for  machines  with  more  storage 
space,  and  for  the  'SWAG  after  the  expected  addition  of  a magnetic 
drum.)  Assume  n to  be  fairly  large  - say  = 15.  For  simplicity  we 
again  confine  our  attention  to  matrices  A > 0.  As  indicated  after 
(38) , the  forward  solution  in  elimination  amounts  to  building  up  a 
new  matrix  BA,  which  must  be  stored  somewhere.  If  BA  is  kept  in  the 
high-speed  memory,  it  occupies  critically  needed  space.  If  it  is 
output  and  input  as  needed  (say  by  rows,  as  Huskey  [61]  describes), 
the  programming  is  complicated  and  the  solution  is  considerably 
slowed.  If  A is  a matrix  which  can  be  generated  internally  as  needed 
(for  instance,  the  matrix  of  the  Laplace  difference  operator) , it 
requires  little  space,  and  BA  becomes  the  principal  item  to  be  stored. 
Where  A cannot  be  generated  internally,  the  storage  problem  gets 
still  worse,  because  the  round-off  errors  can  only  be  reduced  by 
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using  A to  compute  the  residual  r - b - Ax  from  time  to  time,  so  that 

r 

both  BA  and  A must  be  stored. 

These  considerations  suggest  that  a solution  process  should  pre- 
ferably work  with  only  one  matrix,  A itself,  and  should  require  rela- 
tively little  other  storage.  Since  the  instructions  have  to  be  stored, 
this  suggests  a process  of  simple  structure,  repeated  as  often  as 
necessary,  A process  seeming  to  require  a minimum  of  storage  is  the 
cyclic  single-step  procedure  mentioned  in  section  3j  besides  A (if  it 
must  be  stored),  one  need  store  only  one  vector,  x,  at  a time.  This 
method  was  picked  by  Reich  [96]  as  best  for  a digital  computer,  and 
is  undoubtedly  ideal  when  it  converges  fast  enough.  But  we  may  expect 
that  the  convergence  is  frequently  too  slow.  If  an  acceleration  of 
the  types  discussed  in  section  lj.  is  needed,  the  complication  in  pro- 
gramming may  make  another  procedure  preferable.  Another  method  of 
speeding  up  the  cyclic  single-step  method  is  by  appropriately  over- 
correcting (or  undercorrecting  ?)  at  each  step,  as  discussed  by 
Frankel  [38]  and  Young  [129]  for  special  systems  (l) . It  seans  likely 
that  a careful  determination  of  the  optimal  over correct! on  will  some- 
times provide  adequate  convergence,  but  that  it  will  often  fail. 

The  ordinary  relaxation  (i,e,,  piecewise  linear)  processes  require 
about  the  same  storage  as  the  cyclic  single-step  methods;  it  is  not 
clear  whether  they  are  essentially  faster  or  not,  A suggestion  of 
Motzkin  and  Schoenberg  [83]  for  extreme  overrelaxation  is  promising 
but  untried. 
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If  the  above  methods  fail,  one  can  switch  to  the  optimum  gradient 
method  of  section  This  also  works  with  A,  which  must  be  stored  or 
generated,  and  further  requires  the  storage  of  the  two  vectors  x^.  and 
r^,  (The  storage  of  can  be  avoided  if  Xj^  - x^.  is  output  at  each 
step,  and  cumulated  later.)  Again  the  method  is  probably  commonly  too 
slow.  It  can  be  speeded  tip  either  by  the  ft  -optimum  device  of  section 
5»,  for  ft  < 1,  or  by  Richardson's  idea  [98]  of  widely  varying 
*k  - ^ a k over  the  range  of  eigenvalues  of  A”"^, 

If  these  tricks  fail  or  require  too  complex  a program,  the  gra- 

i 

dient  methods  of  section  7 are  available.  Besides  A,  they  require 
the  storage  of  three  vectors  x^,  r^,  p^»  (As  above,  the  outputting 
of  x^..,.^  “ saves  storing  x^.)  Of  these  methods,  there  seems  to  be 
no  reason  for  not  adopting  the  optimum  gradient  method,  since  for  the 
same  storage  its  convergence  is  much  the  best.  Programming  is  simple, 
as  only  one  routine  is  needed;  all  necessary  variations  in  the 
are  provided  automatically,  A drawback  is  that,  since  the  oc ^ and 
other  numbers  vary  so  much  in  the  calculation,  It  is  difficult  to 
provide  scaling  factors  in  advance.  Consequently  one  uses  "floating 
binaiy  point"  operations,  requiring  considerable  memory  space  to 
hold  the  instructions  and  multiplying  the  solution  time  by  a factor 
which  varies  on  the  SWAG  from  the  order  of  £0  (when  A is  generated 
internally)  to  about  one  (when  A is  input  at  each  iterative  step) , 

But  the  method  has  prcved  able  to  cope  with  some  very  "badly  con- 
ditioned" matrices,  as  reported  by  Hestenes,  Hochstrasser,  and 
Wilson  [E>1],  and  Hochstrasser  [!?7].  It  probably  approaches  the 
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ideal  of  a machine  method  which  can  be  relied  on  to  work  automatically 
without  special  analysis  of  the “particular  system  (l) . 

With  any  method  the  partitioning  of  A may  greatly  increase  the 
speed  by  enabling  subsidiary  matrix  inversions  to  take  place  entirely 
within  the  high-speed  memory;  see  [21],  One  usually  thinks  of  Gaussian 

/'V'V 

elimination  on  the  submatrix  blocks.  Would  other  methods  on  the 
blocks  be  preferable? 
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